Efficient Software Decoder Design
نویسندگان
چکیده
In this paper, we evaluate several techniques for generating and optimizing high speed software decoders. We begin by presenting the early stages of a new instruction set description language named ‘Rosetta’. We use specifications written in this language to automatically generate a number of different software decoders. We explore heuristics for generating decoder trees, particularly with regard to enumerating “don’t care” bit positions during evaluation in order to reduce decode tree depth and thus increase performance. We also investigate the application of cache-conscious data placement techniques, decoder structure, and the effects of non-contiguous bit sequences on decoder performance. By applying these techniques to decoders produced for the ARM and IA32 (x86) instruction sets, we are able to produce highly flexible decoders that are comparable in size and performance to carefully handcoded, hand-optimized decoders with substantially less programmer time and effort. Section 1: Introduction Until fairly recently, efficient decoding of an instruction set in software has not been of significant interest beyond the research community, specifically those researchers involved in simulation and testing of new hardware architectures. With the introduction of languages such as Sun’s Java [1] and virtual machine based hardware infrastructures such as Transmeta’s Crusoe processor line [2], the problem of generating efficient software decoders has grown in importance. The fundamental issue in both of these domains is that of high speed dynamic binary translation. The overall performance of Java’s virtual machine interface and Crusoe’s software translation layer is dependent on the ability to quickly translate from ‘non-native’ instruction formats to instructions executable on the host architecture. One of the major components of any such binary translation infrastructure is the software decoder. The ability to quickly identify the non-native instruction is clearly a factor in overall translation speed. In this paper we attempt to analyze several approaches to automatic generation of such high speed decoders. We believe that the key to decoder performance lies in the ability to divide groups of instructions by evaluating large sequences of contiguous bits, minimizing the cost of traversing each stage of the decode tree as well as the average tree depth. While previous work in this area has focused on contiguous or non-contiguous sequences of bits with fixed values for all instructions at a particular step in the decode process (as opposed to bit positions representing “don’t care” values in some or all instructions, termed ‘unbound bits’ from here on), we believe that dynamically expanding (fully enumerating) don’t care positions as needed to allow set division on longer useful contiguous bit sequences may provide a substantial performance benefit by decreasing decoder tree depth. Such an optimization can only realistically be performed through automated techniques, as the usefulness of unbound (“don’t care”) bit positions would be rather difficult for a programmer to evaluate. In order to facilitate automatic decoder generation, we begin by presenting the early stages of a new ISA specification language called Rosetta. This language is inspired by regular expression format, and attempts to use a similarly concise form to describe instruction syntax. Our intent is to expand the language to capture semantic information. At this stage, however, Rosetta provides a very straighforward way to describe instruction syntax. This specification is translated into a flattened internal representation, providing an abstraction from the aesthetic qualities of the specification (useful for human understanding), and allowing decoder generation to focus on aspects of the specification that are directly relevant to decoder structure. We begin our analysis by considering the usefulness of allowing unbound bit expansion during generation of the decoder through a fairly straightforward bit selection heuristic that requires specification of a maximum number of unbound bits. We also consider data placement techniques, reordering decoder state to maximize memory locality of frequently traversed decode nodes. Finally, we evaluate decoder structure, testing data-centered table based decoders against instructioncentered switch statement decoders. We evaluate the performance of resulting decoders it terms of raw decode performance (determined by number of processor cycles required for an average decode) and cache performance (through the Cheetah cache simulation library). From the information gathered in these tests, we attempt to refine out technique. We re-tool the bit selection heuristic to better account for the nature of unbound bits, resulting in better automation and the elimination of the “maximum unbound bits” constraint. We also change the focus of the evaluation to the more efficient switch statement based decoders, and consider the effect of non-contiguous bit sequences. We begin this evaluation by considering related work in Section 2. Section 3 presents a brief overview of the Rosetta specification language syntax, as well as a discussion of the internal representation produced from the specification. The remainder of the paper discusses bit selection and tree generation techniques. Since the primary focus of this work is efficient bit selection and decoder generation, we will present some generic techniques on tree compaction and decoder generation in Sections 4 and 5. These techniques are common to all of the decoder generation heuristics evaluated and are thus presented independently. Section 6 covers the primary aspects of the work, decoder performance evaluation. Finally, Section 7 summarizes and provides concluding remarks. Section 2: Related Work Several groups have done work in the area of instruction set specification and decoder / simulator generation. Vengroff [3] presents a tool called ‘decgen’ that translates a simple specification format into ISA decoders. The specification language is very straightforward, but does not appear to provide sufficient expressivity to easily capture instructions with numerous syntactic forms or optional and variable length fields. For example, in order to express the multiple syntactic forms of a single instruction in the IA32 (x86) ISA, it appears necessary to explicitly describe each valid syntax as a separate instruction description. The decoders produced by decgen are table based, but appear to function by traversing multiple linked tables, making memory location optimizations difficult. The decoders also consider only contiguous sections of globally bound bits, a constrain which we believe can be relaxed, resulting in a increase in decoder performance. The New Jersey Machine Code Toolkit [4] uses the SLED [5] specification language to generate assemblers and disassemblers. SLED provides a class based description format for specifying field locations and names, with separate pattern statements specifying constraints on bit positions. We believe a description format based on regular expressions can capture both field and pattern information in a more readable format. The toolkit generates decoders by building a decision tree based on token sequences provided by the specification writer. We attempt instead to create an internal abstraction that is completely independent of the nuances of the specification in order to better isolate the syntax information that is relevant to the decode process. Architecture Description Language (ADL) is a specification language used by the UPFAST simulator generator [6]. It provides a format not only to describe instruction set syntax, but instruction semantics and architectural semantics as well, allowing the automatic creation of complete microarchitecture simulators as well as the obvious assemblers and disassemblers. The syntax specification format used in ADL, however, seems to fall victim to the same potential problems as decgen, and it is somewhat unclear how useful ADL would be in specifying variable length instruction sets or instruction sets with optional fields. The SimpleScalar Toolset [7] provides syntax and semantic descriptions through definitions written as C macro commands. These macro definitions can be targeted to the requirements of the individual using the toolset. The definitions are translated directly into simulator code at compile time by the C preprocessor. Though this abstraction provides a clean way of describing decoder semantics, it requires that the entire decode tree be specified manually in the definition file, placing the entire burden of building the decoder upon the programmer. It should be noted that none of the specification tools described above currently make provisions for decoder structure optimization based on usage statistics. Though this is obviously a second order affect, we wish to explore potential benefits of such optimizations in operation environments with large working sets. Work in the area of cache conscious data placement and structure layout [8,9] suggests that making such considerations in data placement can improve program performance by reducing cache miss rates. Building off of the basic idea of these works, we attempt to exploit knowledge of the problem domain and usage information acquired through profiling to arrange decoder state in a cache conscious manner. The Rosetta toolset produces tree based decoders emitted as finite state machines. This allows the toolset to optimize decoders using techniques from both tree processing and DFA optimization. In particular, as part of their work on efficient path profiling [10], Ball and Larus present a numbering technique using edge increments that produces a unique numeric value for each unique path through an arbitrary tree. This work provided the basis for the annotation method used in the Rosetta tree compaction algorithm. This compaction algorithm attempts to minimize the number of true states used by the DFA through a method related to standard DFA optimization employed by compilers for regular expression DFA generation [11]. The standard technique minimizes states through systematic partitioning of the state space by distinguishing input sequences. The algorithm used by the Rosetta toolset achieves the same result by condensing all identical subtrees of the complete decoder graph, using the Ball/Larus path numbering technique to help generate self-similar subtrees. Section 3: Rosetta Specification Language The Rosetta ISA specification language is intended to provide a format for completely describing the syntax and semantics of an instruction set. As a first step toward this goal, we have developed a concise method for defining valid instructions in an ISA. This format was originally inspired by regular expression description format, though the final form has undergone significant alterations to better accommodate the domain. Among these modifications is the inclusion of field names and replacement parameters allowing the designer to easily label specific subsequences for later use or to re-use common sub expressions. Consider a simple example shown in Figure 3.1 from the ARM7 ISA. This example serves to demonstrate the basic structure of Rosetta’s expression format. An instruction group is entered with a ‘definst’ statement. This block contains minimally a ‘match’ section within which valid instruction formats are described, and optionally a ‘bind’ section which allows for parameterization based on fields named in the match section. Within the match section, a ‘mainseq’ identifier is used to specify the expression that describes valid instruction. This main sequence may be preceded by any number of subsequences (denoted ‘subseq’ in later examples) which represent what are effectively macro definitions that may be referenced in the future. In this example, both the match and bind sections are fairly straightforward. The match section contains a single sequence description identifying various fields such as registers and instruction flags. These are described simply as a series of bit positions, many of which are just unbound don’t care terms (denoted by the ‘-’). The sequence also specifies certain bit positions that are bound to specific binary values for all instances of the instruction. The bind section does not affect decoding, serving only to parameterize on OP field for later use in semantic descriptions. Though this example serves as a good introduction to the specification format used by Rosetta, it does not really demonstrate the power of the language as applied to instruction set specification. To further explore the features provided by this specification technique we turn to the slightly more challenging example of describing the data processing instructions from ARM7. The match section for this definition is shown in Figure 3.2. Here we see the first use of subsequences to aid in the clarity of the definition. The data processing instructions in ARM7 have two forms, one to utilize a register value as an operand, and one to specify an immediate value. The subsequence format allows the intricacies of these forms to be described in isolation, then introduced into the main matching sequence at their appropriate locations. In this example, the condition field and register specifiers have also been replaced by macros. These macros, rather than being defined with the instruction block, were previously defined in a global section that has the same semantics as a ‘definst’ block but is used only to hold globally useful parameters or commands and does not resolve to any actual instructions. This example clearly demonstrates how regular expression type matching can be used to concisely capture specific constraints in the set of valid instructions. Note, for example, the use of the exclusion (‘^’) operator to create specific constrains in the OPMATCH subsequence, and the square bracket shorthand notation for repeating a particular sequence (e.g.; ‘[-.5]’ means five don’t care bits). The other important feature to note is the ‘>’ symbol which appears before many of the field names in the example. This identifier is used to indicate to the Rosetta toolset that the marked field does not actually affect the semantics of the instruction. For example, while the ‘OP’ field is vital to determining the behavior of an instruction (in this case, selecting between addition, subtraction, etc.) and must be enumerated in the evaluation process in order to correctly identify an instruction, the register specifier ‘Rd’ carries no such semantic value. Rather, it serves as a parameter to the instruction and need not be evaluated until an actual executing instruction is available. This information is used as a hint by the toolset while flattening the regular expression sequences representing an instruction (described later). The distinction between semantically relevant fields and parameter fields leads us to question what actually constitutes an explicit instruction. In the strictest sense, there is no reason for the OP field to be given any more weight than the Rd field during the decode process. One could simply choose to leave the OP field unexpanded, resulting in a decoder that is only able to distinguish groups of instructions. The job of identifying precisely which instruction is represented by a bit sequence could easily be left to the semantic section, for example by scanning the OP field at run-time. We believe, however, that every attempt should be made to push semantically relevant fields into the decoder. This allows the specification to take advantage of decoder optimization techniques discussed later in this paper, and moves towards isolating instruction selection semantics from instruction execution semantics. We believe this will make the resulting specifications far more concise and far easier to understand. The opposite end of the argument, expanding all fields during the decode process, also has far less merit than a balanced approach. The claim that parameter fields (such as the Rd field) do not carry the same semantic value as non-parameter fields comes from the observation that the values of such fields are used directly and do not affect the control flow of the instruction. Thus, while there is some benefit in distinguishing between an ADD and a SUBTRACT instruction at the decoder level, there is no benefit from distinguishing between and ADD to R1 and an ADD to R2. Throughout this work, we adopt the philosophy that fields representing semantic information should be expanded, while fields representing instruction parameters or arguments should not. As a final example, let us consider specification of the Intel IA-32 instruction set, specifically the MOD/RM byte, which presents some rather challenging syntax. The IA-32 specification for Rosetta begins by describing the entire syntax of the MOD/RM, SIB, and displacement bytes as a global subsequence. Since the architecture allows for a 16 bit and 32 bit operating mode, separate specifications are provided for each. The specification of the MOD/RM bytes for the 32 bit definst("Multiply") { match { mainseq = { Cond(----).000000.A(-).S(-).>Rd(----).>Rn(----). >Rs(----).1001.>Rm(----) }; } bind { switch(A) { case 0: { OP = "mul"; } case 1: { OP = "mla"; } } } } Figure 3.1: ARM7 Multiply Instruction (Rosetta Specification)
منابع مشابه
A Novel Design of a Multi-layer 2:4 Decoder using Quantum- Dot Cellular Automata
The quantum-dot cellular automata (QCA) is considered as an alternative tocomplementary metal oxide semiconductor (CMOS) technology based on physicalphenomena like Coulomb interaction to overcome the physical limitations of thistechnology. The decoder is one of the important components in digital circuits, whichcan be used in more comprehensive circuits such as full adde...
متن کاملA High-Performance Hardware Accelerator for HEVC Motion Compensation
The presented master’s thesis has focused on the design and implementation of a motion compensation hardware accelerator for use in HEVC hybrid decoders, i.e. decoders that contain hardware as well as software parts. As the motion compensation is the most time consuming step in the decoding process it is crucial to implement it in a fast and efficient way. This paper elaborates the theoretical ...
متن کاملReconfigurable Signal Processor for Channel Coding & Decoding in Low Snr Wireless Communications
An area and computational-time efficient turbo decoder implementation on a reconfigurable processor is presented. The turbo decoder takes advantage of the latest sliding window algorithms to produce a design with minimal storage requirements as well as offering the ability to configure key system parameters via software. The parameter programmability allows the decoder to be used in a research ...
متن کاملInteractive Encoder-Decoder System for Video
By using hierarchical table-lookup vector quantization (HTVQ) to quantize the large dimensional input vectors successively in stages, combined with any convenient VQ design algorithm for the last stage quantization , the encoder-decoder system can be designed symmetrically, implemented mainly by table-lookups, no extra arithmetic computation, which is amenable to efficient software and hardware...
متن کاملImplementation Aspects of Turbo-Decoders for Future Radio Applications
Turbo-Codes will most likely be employed in future radio systems as a channel coding scheme for highrate data services. However, Turbo-Decoding is a comparatively complex task. To obtain efficient decoder implementations, the system design space has to be explored on multiple levels. In this paper, we span the system design space for Turbo-Codes and describe a method of exploration, while focus...
متن کاملA Flexible Viterbi Decoder for Software Defined Radio
Modern wireless communication standard varies a lot from each other and is evolving rapidly. Flexibility becomes the dominate consideration of software defined radio (SDR) system design. Reconfigurable platform is preferred in the SDR due to the reuse of hardware. Convolutional code is widely adopted in many wireless protocols but the code parameter differs. In order to support multi-standard s...
متن کامل